Abstract
Backgrounds:
The growing accessibility of general-purpose large language models (LLMs) has prompted increased utilization by both patients and healthcare professionals for clinical inquiries. Faced with complex or unresolved clinical issues, physicians, particularly those with less specialized expertise, may supplement traditional resources such as clinical guidelines, UpToDate, and consultations with senior colleagues by consulting LLMs. However, general-purpose LLMs are known to be susceptible to limitations including hallucinations, reliance on outdated or biased training data, and potential data contamination, which can lead to inaccurate or irrelevant responses and pose risks to patient safety. Consequently, the integration of LLMs into clinical workflows necessitates rigorous evaluation to ensure reliability, safety, and ethical alignment.
Previous studies have assessed LLM performance using publicly available databases (e.g., MIMIC), standardized medical examinations (e.g., USMLE), and published case reports. However, there remains a paucity of research evaluating LLM performance in the context of real-world clinical scenarios. Furthermore, to our knowledge, no studies have specifically evaluated the performance of LLMs in addressing questions related to the diagnosis and treatment of lymphoma. Given the inherent complexity of lymphoma management, we undertook an evaluation of LLM capabilities using clinical questions derived from authentic lymphoma-related scenarios.
Methods:
LLM evaluation questions were designed by two senior professors with over 30 years of clinical experience and advanced professional titles, drawing upon their extensive clinical expertise. The question set comprised 55 frequently encountered clinical scenarios spanning 3 pivotal domains: diagnosis, treatment, and patient education, and encompassing common lymphoma subtypes.
Two independent researchers posed the same prompts to Deepseek-R1 (DS), ChatGPT-4o, Claude 3.5, and Llama 3.7. Five lymphoma specialists in different centers, each with over a decade of experience, assessed feedback from LLMs using a 5-point Likert scale. They evaluated answers across four dimensions of professional quality: accuracy, relevance, applicability and potential harm. Answer readability was evaluated by quantifying word count and calculating Flesch Reading Ease and Flesch-Kincaid Grade Level scores.
Results:
Regarding overall professional quality, DS achieved the highest score, followed by GPT, Claude, and Llama, with respective scores of 4.65, 4.53, 4.30, and 4.16. In terms of accuracy, the models scored 4.79, 4.54, 4.32, and 3.67, respectively. Relevance scores were 4.21, 4.15, 4.01, and 3.97, while applicability scores were 4.67, 4.52, 4.41 and 4.18. Llama exhibited the highest potential harm score (1.78). Subgroup analysis by disease area and pathological subtypes mirrored the overall scoring trends. In terms of readability, no statistically significant differences were observed in word count across the four LLMs; however, Claude's responses demonstrated lowest Flesch Reading Ease and Flesch-Kincaid Grade Level scores compared to the other models.
Conclusion:
In the realm of clinical applications for lymphoma, Deepseek-R1 demonstrates superiority in terms of accuracy, quality. It also exhibited superior performance in across multiple domains, signifying its potential in clinical applications. Claude model shows the highest readability, which may be more user-friendly for non-native English speakers. However, given the potential for general-purpose LLMs to generate potentially harmful responses, clinicians should exercise caution when utilizing these tools. These findings underscore the importance of developing specialized medical language models tailored to complex diseases such as lymphoma, to mitigate risks and optimize clinical utility.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal